Representing Documents Using an Explicit Model of Their Similarities
نویسندگان
چکیده
A method is proposed for creating vector space representations of documents based on modeling target inter-document similarity values. The target similarity values are assumed to capture semantic relationships, or associations, between the documents. The vector representations are chosen so that the inner product similarities between document vector pairs closely match their target inter-document similarities. The method is closely related to the Latent Semantic Indexing approach; in fact, they are equivalent when the target similarities are derived directly from document similarities based on term co-occurrence. However, our method allows for external sources of inter-document semantic constraints to be used in the indexing, though at greater computational expense. The method is applied to three standard text databases from the information retrieval literature. On the CISI database of information science abstracts, performance (measured by precision averaged over a range of recall levels) improves by 28% compared to a weighted term-vector approach, and improves 10% compared to Latent Semantic Indexing. Similar improvement is obtained on the Cranneld database, but no improve-1 Bartell, p. 2 ment is obtained for the artiicial MED database of medical abstracts. The generally favorable performance suggests interesting potential for methods which explicitly modify the retrieval system to meet inter-document semantic constraints.
منابع مشابه
Designing a Model for Teacher Competencies in Elementary Education
Purpose: Teacher competencies in the education system is among the most influential and important issues. This importance is rooted in the critical role of teachers in educating people in a society, because the more teachers are prepared and qualified, the greater their impact on upgrading the education system. Methodology: In this regard, upstream documents, as the most extensive strategic and...
متن کاملComparison of strategic planning documents in selected public universities of the country
Abstract: The purpose of this study is to compare the official documents of strategic planning in selected public universities of the country in order to answer these two main questions: what are the differences and similarities between the formal and content elements of strategic planning documents of the selected public universities? Is there a unique design in these documents that fits the s...
متن کاملInvestigating Usage of Text Segmentation and Inter-passage Similarities to Improve Text Document Clustering
Measuring inter-document similarity is one of the most essential steps in text document clustering. Traditional methods rely on representing text documents using the simple Bag-of-Words (BOW) model. A document is an organized structure consisting of various text segments or passages. Such single term analysis of the text treats whole document as a single semantic unit and thus, ignores other se...
متن کاملبررسی نقش انواع بافتار همنویسهها در تعیین شباهت بین مدارک
Aim: Automatic information retrieval is based on the assumption that texts contain content or structural elements that can be used in word sense disambiguation and thereby improving the effectiveness of the results retrieved. Homographs are among the words requiring sense disambiguation. Depending on their roles and positions in texts, homograph contexts could be divided to different types, wit...
متن کاملSMART Electronic Legal Discovery Via Topic Modeling
Electronic discovery is an interesting sub problem of information retrieval in which one identifies documents that are potentially relevant to issues and facts of a legal case from an electronically stored document collection (a corpus). In this paper, we consider representing documents in a topic space using the well-known topic models such as latent Dirichlet allocation and latent semantic in...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- JASIS
دوره 46 شماره
صفحات -
تاریخ انتشار 1995